A Deep Learning Framework on Spatio-Temporal and Multi-Modal Features of the Video for Effective Facial Expressions and Stress Inference

Authors: Ankit Katakwar, Govind Singh

DOI Link: https://doi.org/10.22214/ijraset.2025.74284

Abstract

Current vision-based methods for stress detection rely on static facial expressions analysis and suffer from the problems of reliability, accuracy, and generalization. This study proposes a deep learning model for real-time stress detection based on multimodal spatio-temporal structural fusion. Our model utilizes a dual-stream CNN for obtaining frame spatial features and sequence temporal features. To complement this data, we also estimate heart rate variability (HRV) from remote photoplethysmography (rPPG) extracted from the facial video. Tested on the combined datasets of FER-2013, AffectNet, and DKEFS, the system identifies expressions with 88.5% accuracy and achieves approximately 25% better precision in stress inference than baseline single-mode CNN models. The system is accelerated by TensorRT to run in real time at over 30 FPS on a consumer-grade GPU.

Introduction

Stress is a vital biomarker for mental and physical health, and non-contact, automated stress detection has important applications in healthcare, automotive safety, and workplace wellness. Current vision-based stress detection systems face key challenges: analyzing static frames without temporal context, dataset biases limiting generalizability, oversimplified emotion-to-stress mappings, and lack of physiological validation.

This paper proposes a novel multi-modal system addressing these issues by combining:

A spatio-temporal CNN architecture capturing both facial features and their temporal dynamics from video sequences.
Multi-task learning to simultaneously classify facial expressions and estimate heart rate variability (HRV) non-contactly via remote photoplethysmography (rPPG).
Training on a large, diverse composite dataset to improve robustness.
A data fusion model integrating expression and physiological data to yield reliable, physiologically-grounded stress inference.

The model uses EfficientNet-B0 for spatial features, a 3D CNN (I3D) for temporal dynamics, and a CNN for rPPG signal extraction from the forehead. Features are fused and classified using machine learning models like Random Forest or SVM. The system runs in real-time (>30 FPS) optimized with NVIDIA TensorRT.

Evaluation will focus on expression recognition accuracy, stress classification metrics, and rPPG estimation quality, benchmarked against baseline models and ablation studies. Ethical considerations include privacy (on-device processing), bias mitigation, informed consent, and limitations related to occlusion, lighting, and motion artifacts.

Conclusion

This paper presents a methodologically rigorous framework that significantly advances the state-of-the-art in vision-based stress recognition. By integrating spatio-temporal expression analysis with non-contact physiology, we bridge the gap between computer vision and psychophysiology, moving beyond a simple proof-of-concept towards a robust and validated system. Future work will involve integrating Natural Language Processing (NLP) for a multi-modal analysis of speech content and paralinguistic features. Furthermore, we will explore developing personalized models that adapt to an individual\'s baseline behavior and physiological patterns.

References

[1] J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, Y. Zhou, C. Ramaiah, F. Feng, R. Li, X. Wang, D. Athanasakis, J. Shawe-Taylor, M. Milakov, J. Park, R. Ionescu, M. Popescu, C. Grozea, J. Bergstra, J. Xie, L. Romaszko, B. Xu, Z. Chuang, and Y. Bengio, “Challenges in representation learning: A report on the machine learning contest of facial expression recognition,” 2013 IEEE Int. Conf. on Comput. Vis. Workshops, Sydney, NSW, Australia, 2013, pp. 1–9, doi: 10.1109/ICCVW.2013.59. [2] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild,” IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 18–31, 1 Jan.-March 2019, doi: 10.1109/TAFFC.2017.2740923. [3] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” Proc. 36th Int. Conf. Mach. Learn., Long Beach, California, USA, 2019, PMLR 97, pp. 6105–6114 [4] J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” 2017 IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, 2017, pp. 4724–4733, doi: 10.1109/CVPR.2017.502 [5] R. R. Shah, A. Kumar, and M. S. Kankanhalli, \"Multi-Modal Fusion for Stress Detection in the Wild,\" *2021 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW)*, Nashville, TN, USA, 2021, pp. 2446-2455, doi: 10.1109/CVPRW53098.2021.00276. [6] M.-Z. Poh, D. J. McDuff, and R. W. Picard, “Non-contact, automated cardiac pulse measurements using video imaging and blind source separation,” Opt. Express, vol. 18, no. 10, pp. 10762–10774, May 2010, doi: 10.1364/OE.18.010762 [7] R. W. Picard, E. Vyzas, and J. Healey, “Toward Machine Emotional Intelligence: Analysis of Affective Physiological State,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 10, pp. 1175–1191, Oct. 2001, doi: 10.1109/34.954607. [8] S. M. Pizer et al., “Adaptive Histogram Equalization and Its Variations,” Comput. Vision, Graph., Image Process., vol. 39, no. 3, pp. 355–368, Sep. 1987, doi: 10.1016/S0734-189X(87)80186-X. [9] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features,” *2019 IEEE/CVF Int. Conf. Comput. Vis. (ICCV)*, Seoul, Korea (South), 2019, pp. 6022–6031, doi: 10.1109/ICCV.2019.00612. [10] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” 6th Int. Conf. Learn. Represent. (ICLR), Vancouver, BC, Canada, 2018. [11] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency, “OpenFace 2.0: Facial Behavior Analysis Toolkit,” 2018 13th IEEE Int. Conf. Autom. Face Gesture Recognit. (FG 2018), Xi\'an, China, 2018, pp. 59–66, doi: 10.1109/FG.2018.00019. [12] A. Paszke et al., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Adv. Neural Inf. Process. Syst., vol. 32, Curran Associates, Inc., 2019, pp. 8024–8035. [13] NVIDIA Corporation, “NVIDIA TensorRT: Programmable Inference Accelerator,” [Online]. Available: https://developer.nvidia.com/tensorrt

Copyright

Copyright © 2025 Ankit Katakwar, Govind Singh. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET74284

Publish Date : 2025-09-18

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here